-
Notifications
You must be signed in to change notification settings - Fork 675
fix: Fix KVBM GPU memory leak #4171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
/ok to test 6097426 |
6097426 to
e2e3b9f
Compare
Signed-off-by: krishung5 <[email protected]>
Signed-off-by: krishung5 <[email protected]>
e2e3b9f to
1b820f7
Compare
WalkthroughThe changes enhance slot lifecycle management in the vLLM block manager connector. Updates include acquiring locks before marking slots finished, conditional slot removal based on state (Finished vs. Finishing), early-exit guards during finalization, explicit typing for transfer channels, and synthetic completion notifications on offload failures to ensure progress in error scenarios. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs (1)
992-996: Early-exit guard prevents operations during finalization.The guard correctly prevents new operations from being created when the slot is Finishing or Finished, addressing the race condition described in Issue 1. However, consider logging a debug message when this path is taken to aid observability and debugging, since callers may not be aware the operation was skipped.
Consider adding a debug log:
// Check if slot is in Finishing state before creating operations // If we're finishing, don't create new operations if matches!(self.state, SlotState::Finishing | SlotState::Finished) { + tracing::debug!( + request_id = %self.request_id, + "skipping offload_blocks because slot is {:?}", + self.state + ); return Ok(()); }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
lib/kvbm/src/block_manager/vllm/connector/leader.rs(1 hunks)lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs(4 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-09-03T19:31:32.621Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2840
File: lib/llm/src/kv_router/sequence.rs:86-88
Timestamp: 2025-09-03T19:31:32.621Z
Learning: PeaBrane chose to defer fixing the corner case where a single late-arriving request might never expire in the ActiveSequences expiry mechanism (lib/llm/src/kv_router/sequence.rs). They prefer to avoid adding a background loop for periodic cleanup at this time, accepting the technical debt to keep the current PR scope contained.
Applied to files:
lib/kvbm/src/block_manager/vllm/connector/leader.rs
📚 Learning: 2025-09-18T21:41:02.263Z
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:59-60
Timestamp: 2025-09-18T21:41:02.263Z
Learning: The codebase has a robust two-layer transfer management system: TransferBatcher (offload/pending.rs:400-485) handles batching large transfers into MAX_TRANSFER_BATCH_SIZE chunks, and LocalTransferManager (offload/pending.rs:280-286) limits concurrency to MAX_CONCURRENT_TRANSFERS using FuturesUnordered.
Applied to files:
lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs
📚 Learning: 2025-09-17T01:00:50.937Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.
Applied to files:
lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs
📚 Learning: 2025-06-02T19:37:27.666Z
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.
Applied to files:
lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs
📚 Learning: 2025-05-29T06:20:12.901Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Applied to files:
lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs
📚 Learning: 2025-06-04T18:43:04.566Z
Learnt from: jthomson04
Repo: ai-dynamo/dynamo PR: 1363
File: lib/llm/src/block_manager/block/transfer.rs:206-216
Timestamp: 2025-06-04T18:43:04.566Z
Learning: For NIXL transfers in the KVBM system, the future returned by `nixl::write_blocks_to` is independent of the underlying transfer execution. The transfer begins immediately when `nixl::write_blocks_to` is called, and the returned future is only used for notification/completion tracking. Therefore, it's safe to drop the future when notification is not needed (`notify == false`).
Applied to files:
lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs
🧬 Code graph analysis (1)
lib/kvbm/src/block_manager/vllm/connector/leader.rs (1)
lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs (4)
request_id(94-94)request_id(431-433)state(96-96)state(435-437)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: trtllm (arm64)
- GitHub Check: sglang (amd64)
- GitHub Check: trtllm (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (5)
lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs (3)
1182-1183: LGTM: Explicit channel typing improves clarity.The explicit typing of transfer channels enhances code readability and type safety without altering functionality.
1222-1223: LGTM: Captures IDs for synthetic completion notification.Correctly captures
request_idandoperation_idbefore the request is moved, enabling the synthetic completion pathway on failure.
1234-1258: Synthetic completion notification with empty blocks is safe and correctly addresses the failure scenario.The implementation properly handles offload processing failures:
- Empty
BlockTransferRequestwithRequestType::Immediatecompletes instantly through the completion channelImmediateTransferCompletionHandlesends the completion notification regardless of block count, ensuring the scheduler's completed counter increments even on failure- The Drop impl (protocol.rs:276-288) enforces that
mark_complete()must be called, preventing silent completion tracking failures- Silent error handling (lines 1255-1257) is acceptable since the error is already logged above
The solution correctly ensures scheduler progress even when offload processing fails.
lib/kvbm/src/block_manager/vllm/connector/leader.rs (2)
510-518: LGTM: Lock acquisition prevents race condition.Acquiring the lock before marking the slot as finished (line 518) is critical to preventing the race condition described in Issue 1. This ensures that the state transition to
Finishinghappens atomically with respect to any operation creation attempts inoffload_blocks(). Combined with the early-exit guard added toslot.rs(lines 992-996), this prevents workers from missing the finish event because the slot was removed before they could process pending operations.
527-538: Code is correct - no changes needed.The implementation properly handles the
Finishingstate and prevents memory leaks:
- Worker's
is_complete()checks that all operations are complete by comparing completed count to total operations- The
remove_slot()method includes a defensive assertion requiringslot.is_complete()to be true before removal- Leader extracts pending operations via
take_pending_operations()and adds them to the scheduler so they complete beforeis_complete()returns true- Worker panics if a slot unexpectedly disappears while in the maybe_finished set, catching protocol violations
The conditional removal logic correctly differentiates between
Finishing(operations pending) andFinished(all resources released), ensuring blocks remain available while workers process pending operations.
ziqifan617
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome work!
| // Notify scheduler that this operation is "complete" (even though it failed) | ||
| // Create a fake/immediate transfer request that completes instantly | ||
| // This increments the workers' completed counter so they can progress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Notify scheduler that this operation is "complete" (even though it failed) | |
| // Create a fake/immediate transfer request that completes instantly | |
| // This increments the workers' completed counter so they can progress | |
| // Create a fake/immediate transfer request that completes instantly. | |
| // Otherwise, worker side might stuck and cause memory leak. |
| // Mark the slot as finished (sets state to Finishing if there are operations, | ||
| // or Finished if all operations are complete) | ||
| slot.mark_as_finished(self.iteration_counter)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, by checking the code, I am wondering where slot would be marked as SlotState::Finished. I only saw places to make it as SlotState::Finishing? will follow up offline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch. Resolved offline. Fixed in the latest commit
Signed-off-by: krishung5 <[email protected]>
Overview:
There are two separate but related fixes addressing different aspects of the memory leak/hang problem:
Issue 1: Race Condition in Finish Event Communication:
When vLLM finishes a request, it calls
request_finished()on the leader. The leader and workers must coordinate to clean up GPU blocks, but there was a race condition where the leader would delete the slot before workers could process the finish event.Fix:
Keep Slots Alive in Finishing State - changes in
leader.rsFinishingstate)Prevent New Operations After Finishing - changes in
slot.rsoffload_blocks, If we're finishing, don't create new operations.Issue 2: Offload operations that failed didn't notify the scheduler
When an offload operation fails (e.g., "Not enough blocks available" due to CPU cache being full), the error is logged but no completion notification is sent to the scheduler. This leaves the scheduler's
completedcounter stuck, preventing slots from ever being marked as complete.Fix:
Notify scheduler that this operation is "complete" - changes in
slot.rsCloses DIS-848
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
Release Notes